title: Content Addressed Vocabulary
date: 2020-02-26 19:23
author: Christine Lemmer-Webber
tags: standards, activitypub, json-ld
slug: content-addressed-vocabulary
---

How can systems communicate and share meaning?
Communication within systems is preceded by a form of meta-communication;
we must have a sense that we mean the same things by the terms we use
before we can even use them.

This is challenging enough for humans who must share meaning, but we
can resolve ambiguities with context clues from a surrounding narrative.
Machines, in general, need a context more explicitly laid out for them,
with as little ambiguity as possible.

Standards authors of open-world systems have long struggled with such
systems and have come up with some reasonable systems; unfortunately
these also suffer from several pitfalls.
With minimal (or sometimes none at all) adjustment to our tooling,
I propose a change in how we manage ontologies.


# How we deal with ambiguous terms today

Consider [Note](https://www.w3.org/TR/activitystreams-vocabulary/#dfn-note),
a seemingly simple term in
[ActivityStreams](https://www.w3.org/TR/activitystreams-vocabulary/),
the vocabulary used by [ActivityPub](https://www.w3.org/TR/activitypub/).
The meaning of `Note`, as described by the ActivityStreams vocabulary,
seems simple enough:
`Represents a short written work typically less than a single paragraph in length.`

Here is how an ActivityStreams usage of Note might look (a bit
simplified from what it would probably look like in practice):

``` javascript
  {"@context": "https://www.w3.org/ns/activitystreams",
   "@type": "Note",
   "content": "Would you read me a bedtime story about the great ontology wars?"}
```

What's that `@context` thing?
This is some [JSON-LD](https://www.w3.org/TR/json-ld/) thing, which
tries to be "more exact" about what `Note` we must be talking about.
It does so by mapping `Note` to `https://www.w3.org/ns/activitystreams#Note`
by something like the following:

```javascript
  {"as": "https://www.w3.org/ns/activitystreams#",
   "Note": "as:Note",
   "content": "as:content",
   ...}
```

The choice to use JSON-LD has been semi-controversial in ActivityPub
land; historically there was some debate about whether or not we
needed to be "more exact" at all as to what terms mean.
This post really isn't about JSON-LD as much as it is the more general
topic of vocabularies and vocabulary mapping systems.
There are other concerns people raise about JSON-LD, usually around
the tooling... that's not the scope of this post.
This blogpost could as easily apply to XML or Turtle or whatever;
the protocol I've worked on just happens to use JSON-LD to do that,
so I've used it as my illustration.

That said, the ActivityPub spec tries to make things as simple as
possible for the default case of ActivityPub usage by saying that the
ActivityStreams context is implied, so that if you're not doing anything
complicated, so:

``` javascript
  {"@type": "Note",
   "content": "Would you read me a bedtime story about the great ontology wars?"}
```

... is really the same as the first example.

So okay, probably everyone can guess what `Note` means, but what about
`sensitive`?
What the heck is that?
It doesn't appear in the ActivityStreams vocabulary; it kind of
implies something along the lines of content-warning type behavior, like
"this content may be considered sensitive" by some users, but how would
you guess that just by the term?
This is an *extension*, and it lives at `http://joinmastodon.org/ns#sensitive`.

So maybe if we were going to use it (and if we inline our context) it
might look like:

``` javascript
  {"@context": {"as": "https://www.w3.org/ns/activitystreams#",
                "toot": "http://joinmastodon.org/ns#",
                "Note": "as:Note",
                "content": "as:content",
                "sensitive": "toot:sensitive"},
   "@type": "Note",
   "content": "Would you read me a bedtime story about the great ontology wars?",
   "sensitive": true}
```

(I mean, the Great Ontology Wars are a sensitive topic for some.)

The choice of JSON-LD in ActivityPub is controversial for various
reasons.
But it turns out what isn't really controversial anymore is whether we
need some way of being more exact about the way we speak about terms...
those who used to complain about that mostly now agree (disagreements
then surround what tooling need to be used to do so (not in scope of this
post), and namespace governance (in scope of this post)).

Maybe you feel like, having heard what `sensitive` and `Note` mean,
these are the obvious definitions.
But consider that `Note` itself could have meant something very
different.
Are we talking about a short mostly-textual post (probably on a microblog),
as ActivityStreams does?
Are we talking about a musical note?
Are we instructing someone to take note of something, as an action (or
yes, activity)?

So terms really are ambiguous, and in a decentralized but extensible
system with
[open world assumptions](https://en.wikipedia.org/wiki/Open-world_assumption),
we are eventually going to result in conflicts.
The choice to map our vocabulary to
[URIs](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier)
is actually a very reasonable way to reduce ambiguity.
Unfortunately, the choice to map them to namespaces and to *live* URIs
(a-la `http(s):` URIs), is a mistake that will eventually bite us
(and doubly so for JSON-LD contexts).


# Problems appear

The first problem with choosing to put our terminology URIs at HTTP(S)
URIs is that it assumes that those vocabularies will remain alive.
Perhaps popular ones shall, but really the modern web rots all the time.
Soon enough, many ontologies will eventually be replaced by Viagra ads.

The problem is dramatically worse for json-ld contexts (and similar
documents such as XML DTDs): these are the very documents by which we
map terms to their fully defined meanings.
Servers get hammered by people looking up contextual mappings.
This is no good already.
It gets even worse when such documents add (or otherwise amend) their
terminology mappings; old documents may suddenly mean different things!

(I'd be remiss to not note here that vocabulary namespaces and json-ld
contexts are frequently the same URIs and yet frequently not the same
thing.
Still, they share a lot of the same problems and solutions in terms of
liveness.)

Furthermore, both the choice to put terms in namespaces and the choice
to have common contextual URIs that can change creates governance
problems.

I know this from personal experience (and by that I mean many painful
hours of my life wasted that I can never get back).
Consider `sensitive` above.
The Mastodon folks created their own namespace, as previously mentioned,
but they didn't really *want to*.
The good news was that the
[Social Web Community Group](https://www.w3.org/wiki/SocialCG)
was given permission to both extend the ActivityStreams vocabulary and
the official ActivityStreams context.

Despite the entire group agreeing that it made sense to make `sensitive`
official in some way (which does not mean everyone agreed that it was a
good term, just that it was in enough usage that we should make it more
easily widely available), the SocialCG got tied up for months and months
in meetings being unable to make progress about *how* to do so:

* Should we add `sensitive` to the ActivityStreams namespace, or leave
  it in the old namespace but "officially sanction" it?
* What is the migration path for software using the previous term URI?
* How often should we do this?  What is the governance process for
  incubating a *new* term?  Should it happen in a separate namespace
  first and then get "pulled in" later?
* What would happen if we didn't for terms like these, and the sites
  went down?
* If we also update the json-ld context, what happens for documents
  that already had `sensitive` in them meaning either the old
  URI or a new one?  This can have significant impact on normalization
  for signature verification.

The group met for months about all the topics above and came to no
conclusions.
Eventually we decided that no consensus could be reached, so instead
no action was taken at all.
What a disappointment.

In general, this seems to be common.
Ironically, it leads to otherwise nice decentralized designs for
vocabularies eventually ending up centralized in something like
[schema.org](https://schema.org/) anyway.


# Content addressed vocabularies (and contexts) are the answer

My friend Sandro Hawke offered a solution, which I initially rejected
as terrible, decided upon further consideration was brilliant, and
fully embraced.
Then Sandro explained to me that I had totally misunderstood him,
and that he meant [something different](https://sandhawke.github.io/mov/).
It turns out that I actually think my initial misunderstanding was the
right answer.

Here's what I understood Sandro to say:

> The name we choose for a term doesn't matter that much.
> What really matters is the paragraph or so of specification language
> that describes the term.
> If two implementations refer to the same specification text, they
> mean the same thing.
> So just use that as the description.

Once I (incorrectly) came to realize that this could mean naming via
[content addressing](https://en.wikipedia.org/wiki/Content-addressable_storage),
I latched onto the idea.
Of course!
We had merely selected the wrong edge of
[Zooko's triangle](https://en.wikipedia.org/wiki/Zooko%27s_triangle).
But we know how to fix that sort of thing.

Here's how it works.
Let's remember the specification text for Note above:
`Represents a short written work typically less than a single paragraph in length.`
Let's hash that (along with a "recommendation" prefix that a user might choose
to bind this to the term `Note`, though this is just a recommendation):

```
$ echo "Note: Represents a short written work typically less than a single paragraph in length." | sha256sum
3e1de3b56d2dc1bee7313963462691f9a8f46b068557b75e0e0d14c0994eddc6
```

So if we were defining `Note` via content-addressing, we instead would
have defined it as
`urn:sha256:3e1de3b56d2dc1bee7313963462691f9a8f46b068557b75e0e0d14c0994eddc6`.
This is unambiguous enough to avoid collisions with other uses of the
word "Note".
But note that it doesn't require any servers staying up.
It also doesn't have any namespace governance quagmire, because there
is no namespace.
Updates can be handled the usual way, via errata (translations can be
handled similarly), and standards organizations can still publish such
things... but it is important that the original term remain content-addressed
and immutable.
(Hash migration is left as an exercise for the user, with a hint that
the solution is similar to that with errata.)

Anyway, our post might end up looking in the end like this instead:

``` javascript
  {"@context": {"Note": "urn:sha256:3e1de3b56d2dc1bee7313963462691f9a8f46b068557b75e0e0d14c0994eddc6",
                "content": "urn:sha256:57dc44a1cdcbb7aa976a65a858b4d349ad6110d58d9d546650ce2b0e2b1048e4",
                "sensitive": "urn:sha256:81d98cf83fcf733400ad5d2a25495feeea47f287193a53a9722f4cb025da88f1"},
   "@type": "Note",
   "content": "Would you read me a bedtime story about the great ontology wars?",
   "sensitive": true}
```

I'll note very briefly that content-addressing is also the answer for
JSON-LD contexts.
If something like [Datashards](https://datashards.net/) or
[IPFS](https://ipfs.io/) were used to host json-ld contexts, each post
could link to the exact immutable content-addressed context it was
intended to be used with.
Servers that use such contexts can "pin" them to keep them available,
avoiding a single point of failure (or bandwidth bottleneck).

``` javascript
  {"@context": "idsc:p0.JLnUcJN4R1KNvSXm9Ut3Tmg7WfXAKEOx47p01Pk_Htw.2_rCdtnEha1RpD_qyzxhFIjUvLj7crIbzpmzWei5xRk",
   "@type": "Note",
   "content": "Would you read me a bedtime story about the great ontology wars?",
   "sensitive": true}
```

As one other side-note, I'll also observe that even though the fully
expanded version of the above message is:

``` javascript
  {"@type": "urn:sha256:3e1de3b56d2dc1bee7313963462691f9a8f46b068557b75e0e0d14c0994eddc6",
   "urn:sha256:57dc44a1cdcbb7aa976a65a858b4d349ad6110d58d9d546650ce2b0e2b1048e4": "Would you read me a bedtime story about the great ontology wars?",
   "urn:sha256:81d98cf83fcf733400ad5d2a25495feeea47f287193a53a9722f4cb025da88f1": true}
```

... we never needed to look at it that way because json-ld contexts
(and systems like them) are actually
[petname systems](https://github.com/cwebber/rebooting-the-web-of-trust-spring2018/blob/petnames/draft-documents/petnames.md).


# Conclusions (and non-conclusions)

Let me clarify a claim I'm not making: we don't need to throw away the
old terms for systems like ActivityStreams that are already well
understood.
However, going forward I do think that using content-addressing of new
terms is a good idea.
And in the long run, I think content-addressing of json-ld contexts
and any documents like them is an absolute must (when they aren't
inlined, anyway... but inlining is expensive).

If we adopted Content Addressed Vocabularies, working on vocabulary
extensions to ActivityPub could be a different story.
Imagine a git repository that communities can fork to work on new terms.
We could have a `drafts` directory where people hammer out common
extension terms, and when they're ready, we simply move them to the
`extensions` directory.
Since the names are merely hashes of the contents of that directory,
statically generating a webpage that lists all current known and
recommended extensions would be trivial.
Everything could be handled in issues and PRs, and even if terms
aren't merged into the main repo, that's merely a matter of lower
term discoverability rather than a hinderance of application itself.

If we moved to content addressed vocabulary, we'd be more free from
the perils of downtime and general web bitrot, freer from gatekeeping
and governance challenges, but just as free (I'd argue even freer) to
collaborate.
Moving forward, I intend to ake content addressed approaches to terms
I define in my systems, and I encourage you to do the same.
